This report exmaines interesting relationships found in chemical properities of red wine and how they relate to the quality of the wine.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
he dataset for this report is gathered from Cortez et al. and is a collection of various chemical descriptors of individual red wines along with their quality as rated by a panel of three wine tasting experts. The quality ranges from 0-10 with the recorded statistic being the median of the panel’s choices. Besides this one qualitative statistic, the rest of the statistics are quantitative measures of the individual chemical make up of the wine e.g. acidity, pH, alcohol content, etc.
To begin, it would be helpful to know what the distribution of quality of wines actually is. How often are wines considered good/bad or just “meh”? Let’s take a look:
Interestingly enough, there doesn’t seem to be any wines that knocked it out of the park with a 9-10 rating or were borderline undrinkable with a 1-2 rating. All of the wines seem to normally distributed in a range of 3-8 with a mean of approximately 5.63. Given this scale, most of this sample wines are considered to be “average” according to the panel of experts (As food for thought, it might be interesting to see if this distribution changes if a panel of casual wine drinkers were surveyed instead).
This initial distribution suggests that it may be hard to detect wines on the outer end of the spectrum i.e. wines with a quality of 3-4 or 7-8 given their relative infrequency.
Looking at some of the other distributions:
As a quick chemistry refresher, the pH scale measures acidity with a range of 0 to 14 with 0 being the most acidic (lemon juice, vinegar, etc.) and 14 being the most basic (ammonia). From the distribution, it is clear that acidity tends not to vary a whole lot among red wines as they are normally distributed around a mean pH of 3.31. Given the similar distributions of pH and price, it seems like your average red will have a pH close to 3 which is fairly acidic!
Next, let’s see if sugar or salt content have any meaningful distributions:
## [1] 0.9 15.5
## [1] 0.012 0.611
A cursory look at these two distributions reveals that a long tailed, positvely skewed distribution for each. The range of residual sugar seems be fairly large with a minimum of 0.9 g and a max of a whopping 15.5 g. Similarly, chlorides seem to be fairly low with the exception of a few wines. A better transformation could help zoom these plots in on what appears to be the majority of these data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
From these distributions, it appears that Red wines are typically very dry (less sweet) as indicated by the low sugar content. Chlorides are also in relatively low amounts and clumped near the mean of .087.
Overall, there seems to be a low amount of variance in the majority of wines with regards to sugar and salt content and thus may not be the go to for finding a good predictor for the quality of wine. That is not to say sugar and chlorides aren’t potentially good determinants of quality, but some further exploration of the other variables would be a good idea. Ideally, we can find a few variables that might serve as good predictors that we can then combine in bivariate and multivariate plots.
From our earlier finding that most wines are acidic with a pH between 3 and 4, we can look at some subsets of acidity to determine if various types of acidity could potentially have a relationship with quality. From the dataset, we have three categories and their provided descriptions that describe acidity in some way:
Fixed acidity is defined as the tartaric acid in the wine and volatile acidity is the acetic acid in the wine. Based on the volatile acidity description, it might be the case that some wines with higher amounts of acetic acid would be rated lower quality. I would venture to guess that fixed, or “non-volatile” acidity, should be the close to the inverse of volatile acidity as tartaric acid makes up the majority of these non-volatile acids present in the wine. As far as citric acid goes, it’s a bit harder to guess what the effect may be without some domain knowledge. Coloquially, red wines are not necessarily known for their “freshness” and given the small amounts present in wines in general, I would not expect expect there to be much variance in its distribution in red wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Again, the distributions seem fairly normal for fixed and volatile acidity with the latter having an odd outlier at 1.58. Ulitmately, it’s hard to say how these distributions might relate to quality given their low apparent variance. We may need to revisit them as part of a bivariate analysis with quality.
Citric acid has an interesting distribution that features a lot more variance. There is an interesting dip at around 0.2 g/dm^3 and a spike at just below 0.5 g/dm^3. In isolation, it is hard to tell the significance of this bimodal distribution without further domain knowledge or a reference point but this distribution is still interesting and could be a worthy candidate for further exploration in our bivariate plots.
Finally, we will take a look at the distribution of alcohol:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The range for alcohol content in these red wines ranges from 8% to 15% with most coming in around 9-11%. Drinks with alcohol content over 13-14% are typically considered pretty strong and so its no surprise that these wines are distributed like this. What will be interesting to see how this distribution correlates with quality. Would the panel of experts prefer a stronger wine? Personally I think higher alcoholic content makes for a worse wine, but I am not an expert wine taster, so that does not matter.
The low variance and high granularity of quality makes it difficult to easily see relationships between variables and quality so I am creating a bucket variable to encapsualte the ratings into “gross”, “meh”, and “yummy” with each bucket corresponding to ratings 3-4, 5-6, and 7-8 respectively. These buckets cover the full range of quality found in the dataset and will hopefully make for more visually apparent bivariate plots. The distribution for this new ratings variable can be seen here
The distribution of rating gives similar information to the quality variable and, I would argue, is more useful for answering the question of how good is a certain red wine. Reducing the amount of choices is a better indicator in some instances for getting decsisions such as “Would you drink this wine?” as 1-10 or 1-5 scales seem to be too granular given the distribution of the quality variable. For further analysis, the quality variable will still be used but it may be helpful to additionally overlay the rating to give a more visually apparent distribution of quality of wines in relations to other variables.
There are 1599 observations of 11 variables. All the variables are continuous. Each variable is a chemical property descriptor.
The main feature of interest is quality. Without much domain knowledge of wines, it’s hard to rank other features as being of more interest than another.
Looking at various types of acidity, alcohol content, and maybe SO2 content. In my mind, these “groups” of featuers will probably reveal some interesting relationships with quality although I can’t say for sure just yet.
I created a “rating” bucket variable which will encapsualte some of the quality ratings. Quality is comprised of discrete integers of which only a few are seen therefore it made sense to create this bucket variable for it so that further analysis in bivariate and multivariate will be easier to discern relationships without completely disregarding these data.
Some distributions were log scaled for easier viewing due them being positively skewed but for the most part, no modifications to the data itself was necessary.
Our interest in these bivariate plots is mostly with respect to quality and not neccessarily the relationship of two properties to each other. Although, if there are any interesting correlations between properties, we will look at those too. To start, we will look at an overview of some correlation coefficients for quality to see if there are any areas to immediately start looking at. Subsequently, we can then return to some of the groupings we explored in the last section.
The correlation matrix reveals some intresting facts about these variables relationship to quality. Quality does not seem to have many strong correlations. It seems our estimation that sugar and salt were not going to be good estimators of quality is supported by their respective correlation coefficients of 0.01 (almost no correlation!) and -0.13. These two correlations are particularly low even with respect to a mostly low correlation matrix.
The variables related to acidity, while not having the strongest correlation with quality itself, boast some of the higher correlations such as 0.67 for citric acid and fixed acidity. Volatile acidity has a relatively high correlation with quality as well which matches our intuition as wines with higher acetic acid probably performed worse. Overall, the various acidity measures seem like a good area to explore.
Alcohol content also has a relatively high correlation with quality as well with 0.48.
Based off of just these correlations, we can build an intuition for a “yummy” wine profile as such:
These are obviously a gross overgeneralization but it gives us some direction for exploration via bivariate plots.
Through these histograms, we can visually start to confirm our intuition of quality wine profiles. The first plot shows a higher proportion of yummy wines as alcohol percentage increases. The second plot demonstrates our best correlation as lower amounts of acetic acid results in a much greater proportion of yummy wines. For the third plot of cirtic acid content, our prior intuition of its distribution in relation to quality was vague but interestingly enough, this plot shows a very slight bimodal distribution of yummy wines. Some wines with very low citric acid content are rated highly while relatively higher amounts tend to result in more yummy wines. To quantify this further, the median value of citric acid is 0.26 which serves as a rough dividing line of citric acid content and yummy wines. Finally, higher amounts of sulphates do seem to make up a greater portion of yummy wines but no overly so.
The correlations plot also revealed some interesting correlations outside of quality that seem worth exploring. The nature of these relationships could tell us more about how wine qualities change with respect to each other. One notable correlation is between density and fixed acidity with a correlation of 0.67 which is very high with respect to the other observed correlations. Looking at a plot of these variables we can see the following distribution:
There seems to be a noticeable positive linear correlation between density and fixed acidity. The nature of this relationship is not immediately obvious. Per the attribute description of density, it seems to be related to percent alcohol and sugar content and no mention of acidity. Fixed acidity is the presence of non-volatile acids such as tartaric acid so it is not unreasonable to assume that more acids could possibly result in a slightly higher density. Density does not seem to vary highly as most values are in a small range of .995-.9975 g/cm^3.
Revisiting the correlations plot, fixed acidity is also well correlated with several more variables including citric acid and pH:
pH level goes down as fixed acidity goes up which is expected as lower pH is indicative of higher acidity. The relationships seems to be monotonically decreasing for the most part with most wines around 3.3-3.4 pH level and around 7 g/dm^3. What’s interesting is the presence of some outliers with very low pH below 3 which to me would taste like battery acid. My guess is that those don’t taste very good so it may make sense to look at this plot with respect to quality as well in our multivariate analysis. As for fixed acidity versus citric acid, there seems to be fairly low amounts of citric acid overall which is to be expected. Since increases with citric acid are somewhat correlated with fixed acidity, it may be the case that citric acid is a nonvolatile acid.
As discussed in the “yummy” win profile, there are some slight correlations that seem to exist between quality and
Citric acid had relatively high correlations with a few variables including fixed and volatile acidity, pH, and density.
Volatile acidity had a relatively strong negative correlation with wine rating. However, what is interesting to me is the fact that high amounts of acetic acid do not necessarily mean that the wine will be gross and not just average. Based on the attribute description, it seemed like an expert wine taster would have a higher sensitivity to it in both directions but it is clearly unidirectional.
From our previous analysis, we identified several factors that seem to be good predictors of wine quality. The strongest two seem to be alcohol content and volatile acidity. Through various multivariate plots, we can attempt to further define more complicated relationships between these two variables with others as we continue to build/reinforce a yummy wine profile. Pairings of variables that create distinct rating clumps will best help for finding this profile.
Let’s first take a look at some multivariate alchol plots:
Citric acid seems to introduce a fair amount of variance compared to other variables. Citric acid is also one of the few variables in this dataset to not have a normally distributed, low range count. Its bimodial variance is not very apparent but we can still see a clump of good wines with alcohol content between 9-11% and citric acid greater than its median of 0.26. Additionally, really low citric acid amounts have a clump of gross wines associated with them. With pH we can also see low pH and low alcohol content make it less likely to have a good wine. Volatile acidity plot seems to show some pretty distinct clumps for each wine rating with good wines typically have low volatile acidity across most alcohol content percentages whereas higher volatile acidity has higher amounts of volatile acidity. In fact, most wines seem to have low alcohol content and they are mostly meh as well. Maybe this could be indicative of a more ‘watered-down’ wine although we can’t say for sure based on this plot.
Next we will look at volatile acidity again to see if we can find some variance against other variables.
The above two plots show volatile acidity dominating as a determinant for yummy wines as seen by the concentrated clusters against citric acid and pH. Recall from the bivariate analysis there was a similar observation for citric acid with respect to fixed acidity. With ratings overlayed we have that plot as such:
Given the relationship of fixed acidity to volatile, it is no surprise that they seem to correlate with citric acid in opposite directions. Most good wines still seem to have higher amounts of citric acid regardless of fixed acidity level.
It may be better to examine some variables distinct from alcohol and acidity. The problem that I expect to encounter however is that, as shown by our univariate analysis, the distributions for most of these variables is small and normally distributed about their respective means which does not help for rating classification purposes as all the wines will fall in a small range for the most part. For example:
Here it is very unclear as to what the relationship is from a ratings perspective as it seems each “yummy” point is dispersed randomly about the distribution.
One unexplored area is the amount of sulphates in relation the amount of free flaoting SO2 molecules in wine. We can try taking a look at this by examining sulphates in comparison to a proportion of free floating S02 compared to total S02 since looking at either independently isn’t very useful. The distribution:
The sulphates do not seem to vary very much for the majority of wine but there does seem to be a vertical boundary line that separates yummy wines from the majority of meh ones. I would venture to guess that there is some sort of “threshold” effect here where a certain amount of sulphates
Volatile acidity and alcohol created distinct ratings areas in their scatterplot that was hard to achieve with other variables. These two probably contribute the most towards a wine’s quality.
Citric acid’s unique distribution created some strange interactions when plotted against other features due to its bimodal distribution that is proportional to higher quality wines.
This plot shows both lower amounts of acetic acid resulting in more quality wines along with a well defined range of alcohol content in 11-13% for the majority of good wines. For the volatile acidity, there is almost a clear visual boundary that defines a threshold for which volatile acidity level is low enough for a good wine. Wines low in alcohol content or high in volatile acidity are almost guaranteed to not produce a good wine. The optimal range for alcohol content is above the median value of 10.42% as well as the mean of 10.20%.
Fixed Acidity is not well correlated with quality with just a 0.12 correlation coefficient. This multivariate graph shows a trend that increasing fixed acidity amounts in conjunction with a small amount of citric acid results in an overall higher quality wine. Conversley, we can observe that fixed acidity below 7 g/dm^3 and citric acid amounts less than 0.26, its median value, result in a large amount of “meh” or “gross” wines.
This plot is interesting in that it shows a few things at once. One, it demonstrates a counterintuitive relationship between sulphates, a wine additive that can contribute to S02 presence in wine, and the amount of free floating S02 molecules. Sulphates addtiion has very little impact on the variance of the proportion at all but we can see that at around 0.6 we start to see good wines appear for a lot of higher proportions of free S02s. According the S02 attribute description, this is actually expected as S02 concentrations are hard to detect in small quantities until a certain amount is added at which point, it is detectable and becomes a contributor to the smell and taste of the wine. This threshold is fairly distinct in this graph as it is most likely the point at which we see yummy wines start to appear in the plot.
This analysis presented numerous difficulties and successes. The narrow, normal distributions of many of these continuous variables seemed to stifle their usefulness to me as I was, and still am, unsure of how to analyze them adequately. Additionally, the multivariate anaylsis was difficult because I had limited domain knowledge to ask the right questions to come up with some useful plots. I felt like I was at times brute forcing my way through combinations of variables because I couldn’t come up with interesting questions compared to say, wokring with the psuedo facebook data. Working around the limitations of the quality variable via buckets was a success for me as I spent quite a while trying to figure out what to do with such an odd distribution of scores. The ratings variable opened up numerous avenues for analysis to indirectly gauge quality.
For future work, I would like to build a linear regression model and potentially do some feature importance selection through various methods to further guide my thinking and exploration of this dataset. I believe some transformations of the data might be in order for continued exploration but I am unsure of what those might be at this time.